All, and only, the Errors: more Complete and Consistent Spelling and OCR-Error Correction Evaluation
نویسنده
چکیده
Some time in the future, some spelling error correction system will correct all the errors, and only the errors. We need evaluation metrics that will tell us when this has been achieved and that can help guide us there. We survey the current practice in the form of the evaluation scheme of the latest major publication on spelling correction in a leading journal. We are forced to conclude that while the metric used there can tell us exactly when the ultimate goal of spelling correction research has been achieved, it offers little in the way of directions to be followed to eventually get there. We propose to consistently use the well-known metrics Recall and Precision, as combined in the F score, on 5 possible levels of measurement that should guide us more informedly along that path. We describe briefly what is then measured or measurable at these levels and propose a framework that should allow for concisely stating what it is one performs in one’s evaluations. We finally contrast our preferred metrics to Accuracy, which is widely used in this field to this day and to the Area-Under-the-Curve, which is increasingly finding acceptance in other fields.
منابع مشابه
Design and implementation of Persian spelling detection and correction system based on Semantic
Persian Language has a special feature (grapheme, homophone, and multi-shape clinging characters) in electronic devices. Furthermore, design and implementation of NLP tools for Persian are more challenging than other languages (e.g. English or German). Spelling tools are used widely for editing user texts like emails and text in editors. Also developing Persian tools will provide Persian progr...
متن کاملOCR Post-Processing Error Correction Algorithm Using Google's Online Spelling Suggestion
With the advent of digital optical scanners, a lot of paper-based books, textbooks, magazines, articles, and documents are being transformed into an electronic version that can be manipulated by a computer. For this purpose, OCR, short for Optical Character Recognition was developed to translate scanned graphical text into editable computer text. Unfortunately, OCR is still imperfect as it occa...
متن کاملOCR Post-Processing Error Correction Algorithm using Google Online Spelling Suggestion
With the advent of digital optical scanners, a lot of paper-based books, textbooks, magazines, articles, and documents are being transformed into an electronic version that can be manipulated by a computer. For this purpose, OCR, short for Optical Character Recognition was developed to translate scanned graphical text into editable computer text. Unfortunately, OCR is still imperfect as it occa...
متن کاملDiploma Thesis: Unsupervised Post-Correction of OCR Errors
The trend to digitize (historic) paper-based archives has emerged in the last years. The advantages of digital archives are easy access, searchability and machine readability. These advantages can only be ensured if few or no OCR errors are present. These errors are the result of misrecognized characters during the OCR process. Large archives make it unreasonable to correct errors manually. The...
متن کاملAn Unsupervised and Data-Driven Approach for Spell Checking in Vietnamese OCR-scanned Texts
OCR (Optical Character Recognition) scanners do not always produce 100% accuracy in recognizing text documents, leading to spelling errors that make the texts hard to process further. This paper presents an investigation for the task of spell checking for OCR-scanned text documents. First, we conduct a detailed analysis on characteristics of spelling errors given by an OCR scanner. Then, we pro...
متن کامل